dot-product attention
- North America > United States (0.28)
- Europe > Switzerland > Vaud > Lausanne (0.04)
- Asia > Middle East > Israel (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- North America > United States > Texas > Travis County > Austin (0.14)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- (8 more...)
Redesigning the Transformer Architecture with Insights from Multi-particle Dynamical Systems
The Transformer and its variants have been proven to be efficient sequence learners in many different domains. Despite their staggering success, a critical issue has been the enormous number of parameters that must be trained (ranging from $10^7$ to $10^{11}$) along with the quadratic complexity of dot-product attention. In this work, we investigate the problem of approximating the two central components of the Transformer --- multi-head self-attention and point-wise feed-forward transformation, with reduced parameter space and computational complexity. We build upon recent developments in analyzing deep neural networks as numerical solvers of ordinary differential equations. Taking advantage of an analogy between Transformer stages and the evolution of a dynamical system of multiple interacting particles, we formulate a temporal evolution scheme, \name, to bypass costly dot-product attention over multiple stacked layers. We perform exhaustive experiments with \name\ on well-known encoder-decoder as well as encoder-only tasks. We observe that the degree of approximation (or inversely, the degree of parameter reduction) has different effects on the performance, depending on the task. While in the encoder-decoder regime, \name\ delivers performances comparable to the original Transformer, in encoder-only tasks it consistently outperforms Transformer along with several subsequent variants.
- North America > United States (0.28)
- Europe > Switzerland > Vaud > Lausanne (0.04)
- Asia > Middle East > Israel (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
) and
We would like to thank all of the reviewers for their time and thoughtful comments on our paper. MAAC's use of critic-attention only to reduce state-space representation, not To substantiate this claim, we performed an analysis of TarMAC's More importantly, SARNet's use of a dedicated memory unit and the ability's suggestion we will add results from We have described it in Appendix A.1.4 and we will add further details by including However, we did not see performance gains for the tasks in the paper. We will note results with gates in the revision. However, SARNet's performance is substantially better than baselines when the task becomes harder (more agents) and
- Health & Medicine (0.93)
- Energy > Oil & Gas (0.67)
- Law > Environmental Law (0.45)
Stochastic Clock Attention for Aligning Continuous and Ordered Sequences
We formulate an attention mechanism for continuous and ordered sequences that explicitly functions as an alignment model, which serves as the core of many sequence-to-sequence tasks. Standard scaled dot-product attention relies on positional encodings and masks but does not enforce continuity or monotonicity, which are crucial for frame-synchronous targets. We propose learned nonnegative \emph{clocks} to source and target and model attention as the meeting probability of these clocks; a path-integral derivation yields a closed-form, Gaussian-like scoring rule with an intrinsic bias toward causal, smooth, near-diagonal alignments, without external positional regularizers. The framework supports two complementary regimes: normalized clocks for parallel decoding when a global length is available, and unnormalized clocks for autoregressive decoding -- both nearly-parameter-free, drop-in replacements. In a Transformer text-to-speech testbed, this construction produces more stable alignments and improved robustness to global time-scaling while matching or improving accuracy over scaled dot-product baselines. We hypothesize applicability to other continuous targets, including video and temporal signal modeling.
- Asia > South Korea > Seoul > Seoul (0.05)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- North America > United States > Texas > Travis County > Austin (0.14)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- (8 more...)
- Health & Medicine (0.93)
- Energy > Oil & Gas (0.67)